Convolution Neural Networks

Shan-Hung Wu & DataLab
Fall 2022

In this lab, we introduce two datasets, MNIST and CIFAR, then we will talk about how to implement CNN models for these two datasets using tensorflow. The major difference between mnist and cifar is their size. Due to the limit of memory size and time issue, we offer a guide to illustrate typical input pipeline of tensorflow. Let's dive into tensorflow!

MNIST

We start from a simple dataset. MNIST is a simple computer vision dataset. It consists of images of handwritten digits like:

It also includes labels for each image, telling us which digit it is. For example, the labels for the above images are 5, 0, 4, and 1. Each image is 28 pixels by 28 pixels. We can interpret this as a big array of numbers:

The MNIST data is hosted on Yann LeCun's website. We can directly import MNIST dataset from Tensorflow.

Softmax Regression on MNIST

Before jumping to Convolutional Neural Network model, we're going to start with a very simple model with a single layer and softmax regression.

We know that every image in MNIST is a handwritten digit between zero and nine. So there are only ten possible digits that a given image can be. We want to give the probability of the input image for being each digit. That is, input an image, the model outputs a ten-dimension vector.

This is a classic case where a softmax regression is a natural, simple model. If you want to assign probabilities to an object being one of several different things, softmax is the thing to do.

From the above result, we got about 92.4% accuracy for Softmax Regression on MNIST. In fact, it's not so good. This is because we're using a very simple model.

Multilayer Convolutional Network on MNIST

We're now jumping from a very simple model to something moderately sophisticated: a small Convolutional Neural Network. This will get us to over 99% accuracy, not state of the art, but respectable.

Create the convolutional base

As input, a CNN takes tensors of shape (image_height, image_width, color_channels), ignoring the batch size. If you are new to color channels, MNIST has one (because the images are grayscale), whereas a color image has three (R,G,B). In this example, we will configure our CNN to process inputs of shape (28, 28, 1), which is the format of MNIST images. We do this by passing the argument input_shape to our first layer.

Let's display the architecture of our model so far.

Above, you can see that the output of every Conv2D and MaxPooling2D layer is a 3D tensor of shape (height, width, channels). The width and height dimensions tend to shrink as we go deeper in the network. The number of output channels for each Conv2D layer is controlled by the first argument (e.g., 32 or 64). Typically, as the width and height shrink, we can afford (computationally) to add more output channels in each Conv2D layer.

Add Dense layers on top

To complete our model, we will feed the last output tensor from the convolutional base (of shape (3, 3, 64)) into one or more Dense layers to perform classification. Dense layers take vectors as input (which are 1D), while the current output is a 3D tensor. First, we will flatten (or unroll) the 3D output to 1D, then add one or more Dense layers on top. MNIST has 10 output classes, so we use a final Dense layer with 10 outputs and a softmax activation.

To reduce overfitting, we will apply dropout before the readout layer. The idea behind dropout is to train an ensemble of model instead of a single model. During training, we drop out neurons with probability $p$, i.e., the probability to keep is $1-p$. When a neuron is dropped, its output is set to zero. These dropped neurons do not contribute to the training phase in forward pass and backward pass. For each training phase, we train the network slightly different from the previous one. It's just like we train different networks in each training phrase. However, during testing phase, we don't drop any neuron, and thus, implement dropout is kind of like doing ensemble. Also, randomly drop units in training phase can prevent units from co-adapting too much. Thus, dropout is a powerful regularization techique to deal with overfitting.

Here's the complete architecture of our model.

As you can see, our (3, 3, 64) outputs were flattened into vectors of shape (576) before going through two Dense layers.

Compile and train the model

As you can see, our simple CNN has achieved a test accuracy of 99%. Not bad for a few lines of code! For another style of writing a CNN (using the Keras Subclassing API and a GradientTape) head here.

Cifar-10

Actually MNIST is a easy dataset for the beginner. To demonstrate the power of Neural Networks, we need a larger dataset CIFAR-10.

CIFAR-10 consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Here are the classes in the dataset, as well as 10 random images from each:

Before jumping to a complicated neural network model, we're going to start with KNN and SVM. The motivation here is to compare neural network model with traditional classifiers, and highlight the performance of neural network model.

tf.keras.datasets offers convenient facilities that automatically access some well-known datasets. Let's load the CIFAR-10 in tf.keras.datasets:

For simplicity, we also convert the images into the grayscale. We use the Luma coding that is common in video systems:

As we can see, the objects in grayscale images can still be recognizable.

Feature Selection

When coming to object detection, HOG (histogram of oriented gradients) is often extracted as a feature for classification. It first calculates the gradients of each image patch using sobel filter, then use the magnitudes and orientations of derived gradients to form a histogram per patch (a vector). After normalizing these histograms, it concatenates them into one HOG feature. For more details, read this tutorial.

Note. one can directly feed the original images for classification; however, it will take lots of time to train and get worse performance.

Once we have our getHOGfeat function, we then get the HOG features of all images.

K Nearest Neighbors (KNN) on CIFAR-10

scikit-learn provides off-the-shelf libraries for classification. For KNN and SVM classifiers, we can just import from scikit-learn to use.

We can observe that the accuracy of KNN on CIFAR-10 is embarrassingly bad.

Support Vector Machine (SVM) on CIFAR-10

CNN on CIFAR-10

By above, SVM is slightly better than KNN, but still poor. Next, we'll design a CNN model using tensorflow.

Although Cifar10 is larger than Mnist, it's not large enough for the dataset you will meet in the following lessons. For large datasets, we can't feed all training data to the model due to the limit of memory size. Even if we can feed all data into the model, we still want the process of loading data is efficient. Input pipeline is the common way to solve these.

Input Pipeline

Structure of an input pipeline

A typical TensorFlow training input pipeline can be framed as an ETL process:

  1. Extract: Read data from memory (NumPy) or persistent storage -- either local (HDD or SSD) or remote (e.g. GCS or HDFS).
  2. Transform: Use CPU to parse and perform preprocessing operations on the data such as shuffling, batching, and domain specific transformations such as image decompression and augmentation, text vectorization, or video temporal sampling.
  3. Load: Load the transformed data onto the accelerator device(s) (e.g. GPU(s) or TPU(s)) that execute the machine learning model.

This pattern effectively utilizes the CPU, while reserving the accelerator for the heavy lifting of training your model. In addition, viewing input pipelines as an ETL process provides a framework that facilitates the application of performance optimizations.

tf.data API

To build a data input pipeline with tf.data, here are the steps that you can follow:

  1. Define data source and initialize your Dataset object
  2. Apply transformations on the dataset, following are some common useful techniques
    • map
    • shuffle
    • batch
    • repeat
    • prefetch
  3. Create iterator

Construct your Dataset

To create an input pipeline, you must start with a data source. For example, to construct a Dataset from data in memory, you can use tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Alternatively, if your input data is stored in a file in the recommended TFRecord format, you can use tf.data.TFRecordDataset().

Once you have a Dataset object, you can transform it into a new Dataset by chaining method calls on the tf.data.Dataset object. For example, you can apply per-element transformations such as Dataset.map(), and multi-element transformations such as Dataset.batch(). See the documentation for tf.data.Dataset for a complete list of transformations.

Now suppose we have simple data sources:

We can create our tensorflow Dataset object with these two data using tf.data.Dataset.from_tensor_slices, which will automatically cut your data into slices:

Apply transformations

Next, according to your needs, you can preprocess your data in this step.

map

For example, Dataset.map() provide element-wise customized data preprocessing.

shuffle

Dataset.shuffle(buffer_size) maintains a fixed-size buffer and chooses the next element uniformly at random from that buffer. This way, you can see your data coming with different order in different epoch. This can prevent your model overfit on the order of your training data.

batch

Now our dataset is one example by one example. However, in reality, we usually want to read one batch at a time, thus we can call Dataset.batch(batch_size) to stack batch_size elements together.

Note: Be careful that if you apply Dataset.shuffle after Dataset.batch, you'll get shuffled batch but data in a batch remains the same.

repeat

Repeats this dataset count times.

Dataset.repeat(count) allow you iterate over a dataset in multiple epochs. count = None or -1 will let the dataset repeats indefinitely.

If you would like to perform a custom computation (e.g. to collect statistics) at the end of each epoch then it's simplest to restart the dataset iteration on each epoch:

prefetch

Creates a Dataset that prefetches elements from this dataset.

Dataset.prefetch(buffer_size) allow you decouple the time when data is produced from the time when data is consumed.

Consume elements

The Dataset object is a Python iterable. This makes it possible to consume its elements using a for loop:

Or by explicitly creating a Python iterator using iter and consuming its elements using next:

repeat+batch / batch+repeat

The Dataset.repeat transformation concatenates its arguments without signaling the end of one epoch and the beginning of the next epoch. Because of this a Dataset.batch applied after Dataset.repeat will yield batches that stradle epoch boundaries:

If you need clear epoch separation, put Dataset.batch before the repeat:

shufflt+repeat / repeat+shufflt

As with Dataset.batch the order relative to Dataset.repeat matters.

Dataset.shuffle doesn't signal the end of an epoch until the shuffle buffer is empty. So a shuffle placed before a repeat will show every element of one epoch before moving to the next.

But a repeat before a shuffle mixes the epoch boundaries together.

Now, let's start designing our cnn model!

CNN Model for CIFAR 10

Loading Data Manually

To know how it works under the hood, let's load CIFAR-10 by our own (not using tf.keras). According the descripion, the dataset file is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class.

We have done our training! Let's see whether our model is great or not.

Optimization for input pipeline

We all know that GPUs can radically reduce the time required to execute a single training step; however, all other affairs (including data loading, data transformations, memory copy from CPU to GPUs) are done by CPU, which sometimes becomes the bottleneck instead. We have learned above that there are lots transformations that make datasets more complex and reusable. Now, we are going to accelerate the input pipeline for better training performance, following this guide.

The code below do the same thing in CNN Model for CIFAR 10. However, we change the dataset structure to show the time comsuming during the training.

The dataset pipeline of (dataset_train, dataset_test) is same to the CNN Model for CIFAR 10 part. However, if we optimize the pipeline as below, the performance would be better. The optimization is including:

  1. prefetching: overlaps the preprocessing and model execution of a training step.
  2. Interleave (Parallelizing data extraction): parallelize the data loading step, interleaving the contents of other datasets (such as data file readers).
  3. Parallel mapping: parallelized mapping across multiple CPU cores.
  4. Caching: cache a dataset, save some operations (like file opening and data reading) from being executed during each epoch.
  5. Vectorizing mapping: batch before map, so that mapping can be vectorized.

It's recommended to study the terms above in the official documentation. Here we only demonstrate the timprovement.

From the results above, we can find that the time comsuming reduces from 785 to 731 (sec).

It seems that there's not much imporvement of time comsuming. Even though there's exactly no "Open" and "Read" time consuming in 2nd epoch in the graph above (which is because of the Caching), the bottleneck here is the training steps, rather than I/O. Since we read images from .pkl files, which is an binary file with faster I/O speed, the loading of reading and mapping is much smaller than training steps. However, if we switch the situation like reading images from .jpg/.png files (what you would do in the assignment below), the imporvement would be evident.

Assignment

In this assignment, you have to implement the input pipeline of the CNN model and try to write/read tfrecord with the Oregon Wildlife dataset.

We provide you with the complete code for the image classification task of the CNN model, but remove the part of the input pipeline. What you need to do is completing this part and training the model for at least 5 epochs.

Description of Dataset:

  1. The raw data is from kaggle, which consists of 20 class images of wildlife.
  2. We have filtered the raw data. You need to download the filtered images from here and use them to complete the image classification task.
  3. In the dataset we prepared for you, there are nearly 7,200 images, which contain 10 kinds of wildlife.

The sample image is shown below:

red_fox

Requirement:

Notification:

The accuracy now is 11.85% in validation set, costing with 1046 sec. Now try some data augmentation (transformation) to observe whether the accuracy and execution time are increased or decreased.

After trying data augmentation (transformation), it's time to optimize what you did above for better efficiency.